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The efficient implementation of collective communication operations has received much 
attention. Initial efforts produced "optimal" trees based on network communication mod- 
els that assumed equal point-to-point latencies between any two processes. This assump- 
tion is violated in most practical settings, however, particularly in heterogeneous systems 
such as clusters of SMPs and wide-area "computational Grids," with the result that col- 
lective operations perform suboptimally. In response, more recent work has focused on 
creating topology-aware trees for collective operations that minimize communication across 
slower channels (e.g., a wide-area network). While these efforts have significant commu- 
nication benefits, they all limit their view of the network to only two layers. We present 
a strategy based upon a multilayer view of the network. By creating multilevel topology- 
aware trees we take advantage of communication cost differences at every level in the 
network. We used this strategy to implement topology-aware versions of several MPI col- 
lective operations in MPICH-G2, the Globus Toolkit"^*'^-enabled version of the popular 
MPICH implementation of the MPI standard. Using information about topology provided 
by MPICH-G2, we construct these multilevel topology-aware trees automatically during 
execution. We present results demonstrating the advantages of our multilevel approach 
by comparing it to the default (topology-unaware) implementation provided by MPICH 
and a topology-aware two-layer implementation. 

Key Words: MPI, collective operations, MPICH-G2, grid computing, Globus 
Toolkit 
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1. INTRODUCTION 



The problem of building "optimal" communication trees for collective operations 
has received much attention in recent years. The telephone model, which assumes 
that send and receive times are equal and that messages are not packetized, implies 
that the optimal broadcast algorithm uses a binomial tree. Under models that 
expand the telephone model to account for message latency, such as the postal 
or LogP 1^ models, the communication topology of an optimal broadcast algorithm 
becomes a generalized Fibonacci tree. All of these approaches construct optimal 
trees for collective operations by first modeling the communication characteristics 
of a network with a set of parameters and then building the optimal trees based on 
parameter values and their model. 

Underlying this work is the assumption that the communication times between 
all process pairs in the computation are equal. While this is a reasonable approx- 
imation when the entire computation is performed on a single machine, it is not 
reasonable when the computation is executed on a cluster of symmetric multiproces- 
sors (SMPs) in a local-area network, or worse, in a computational Grid | [Tc| , pi] , ^ 
environment, in which multiple parallel computers are connected by local-area, 
campus-area, or even wide-area networks. Rapid improvements in network per- 
formance have engendered considerable interest in parallel computing in the last 
context, as evidenced by experiments and initiatives such as the I- WAY National 
Technology Grid (2^, Information Power Grid and TeraGrid 

Under these circumstances the trees produced by the conventional models per- 
form suboptimally. In such heterogeneous environments, communication costs over 
different links can differ by an order of magnitude or more. In these situations, 
topology-aware algorithms can dramatically improveme the performance. For exam- 
ple, in the case of N processors distributed into two clusters, a traditional reduction 
algorithm may generate 0(log N) intercluster messages, while a topology-aware al- 
gorithm generates only 1, for a cost saving of a factor of 0(log N) if intercluster 
message costs dominate. 

Previous work jl^, |l^ has demonstrated that topology-aware collective oper- 
ations can indeed reduce communication costs by reducing the amount of com- 
munication performed over slow channels. However, this work limited the depth 
of network stratification to only two levels: other processors are either near or 
far. In we compared a prototype of our multilevel approach to the topology- 
wnaware binomial tree algorithm distributed with MPICH and to MagPIe, one of 
the topology-aware two-level techniques. In that prototype we "guessed" which 
computers shared a local network by inspecting their fully qualified domain names, 
and thereafter representing our multilevel clustering of processes with a sequence 
of hidden communicators inside MPI communicators. 

In this paper we present a much improved refinement of that prototype that 
allows collective operations to exploit knowledge concerning the structure of a mul- 
tilevel network, in which the neighbors are processors that are categorized according 
to their expected point-to-point communication characteristics. The identification 
of which processes share a local network is now a simple matter of users providing 
values for selected environment variables. Additionally the use of hidden commu- 
nicators to represent the multilevel clustering has been replaced by integer vectors. 
The use of hidden communicators required us to implement the collective operations 
as a sequence of collective operations, for example, an MPI_Bcast was implemented 
as a sequence of MPI_Bcasts sequencing over each of the hidden communicators in 
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FIG. 1 An example of a Grid computation involving 10 processes on one IBM SP 
at SDSC and another 10 processes distributed evenly across two SGI Origin2000s 
(02Ka and 02K6) at NCSA. 



turn, which typically resulted in the use of binomial trees at each level. By replac- 
ing the hidden communicators with integer vectors we are now free to implement 
collective operations using point-to-point operations over any tree we create. 

To permit experimental studies, we have implemented our multilevel approach 
for five of the collective operations supported by the Message Passing Interface 
(MPI) standard Q: MPI_Bcast, MPI_Reduce, MPI_Barrier, MPI_Gather, and 
MPI_Scatter. We use MPICH-G2 @, the successor to MPICH-G [§, which is 
based on the popular MPICH implementation pi of the MPI standard. MPICH-G2 
uses services provided by the Globus Toolkit"' , or simply Globus, to support ex- 
ecution in heterogeneous and distributed environments. This use of MPICH-G2 
enables experimentation within realistic wide-area environments that would not 
otherwise be easily accessible. 

In the sections that follow, we describe our multilevel topology approach. Then, 
we present experimental results that illustrate the benefits of our multilevel ap- 
proach by comparing it with (1) the topology- unaware implementation currently 
distributed with MPICH and (2) MagPIe jl^l , one of the topology-aware two- level 
implementations of collective operations. We briefly discuss other recent topology- 
aware and optimized collective operations efforts and conclude with a discussion of 
future work. 

2. MULTILEVEL TOPOLOGY-AWARE APPROACH 

Figure ^ depicts an MPI application involving 20 processes distributed over 
three machines located at the San Diego Supercomputer Center (SDSC) and the 
National Center for Supercomputing Applications (NCSA). We depict 10 processes 
on the IBM SP at SDSC and 5 processes on each of two Origin2000s, 02Ka and 
02Kb, at NCSA. The slowest communication is between sites, which uses TCP 
over a wide-area network, with faster communication between the 02Ks at NCSA, 
which uses TCP over their local-area network, and the fastest communication, of 
course, within each machine. 

In the remainder of this section we describe a broadcast using first the topology- 
unaware implementation currently distributed with MPICH, then a 2-level topology- 
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FIG. 2 The binomial trees Bq through ^3. 

aware approach, and finaUy our multilevel topology- aware broadcast. 

2.1. A Topology- C/riaware Broadcast 

Topology-unaware implementations of broadcast, including the one distributed 
with MPICH, often make the simplifying assumption that the communication times 
between all process pairs in the computation are equal. Under this assumption the 
broadcast is often implemented by using a binomial tree. 

A binomial tree Bk is an ordered tree (i.e., children of each node are ordered) of 
order fc > defined recursively. As shown in Figure ^ the binomial tree Bq consists 
of a single node. The binomial tree Bk {k > 0) has a root with k children where 
the j*'' child (0 < i < fc) is the root of the binomial tree Bk-i- Figure || depicts the 
binomial trees Bq through B3. 

When communication times between all process pairs in the computation are 
equal and have relatively low latency, Bar-Noy and Kipnis show that implementing 
a broadcast with a binomial tree has the desirable property that all processes will 
complete the broadcast at approximately the same time thus, achieving proper load 
balancing [Q. 

2.2. A 2-Level Topology- Aware Broadcast 

Existing 2- level topology-aware approaches [l^ cluster processes into groups. 
The two natural choices for the machines depicted in Figure ^ are to cluster the 
processes based either on machine boundaries, creating three groups - the IBM SP, 
02Ka, and 02Kb, or site boundaries creating two groups - SDSC and NCSA. While 
both are reasonable choices and would improve performance when compared with 
the topology- unaware binomial tree distributed with MPICH, both choices ignore 
the disparity in network performance between the local- and wide-area networks. 
Consider, for example, a broadcast rooted at one of the processes at SDSC. Fig- 
ure ^ depicts the broadcast tree of the 2-level approach when the processes are 
clustered on machine boundaries. The broadcast starts with the SDSC root process 
sending messages to designated processes on each of the 02Ks at NCSA, result- 
ing in two messages travelling across the wide-area network, and concludes with 
broadcasts within each machine. By contrast. Figure ^ depicts the broadcast tree 
when the processes are clustered on site boundaries. In this case the root at SDSC 
sends a single message across the wide-area network to a process on one of the two 
02Ks at NCSA and concludes with a broadcast within the IBM SP with another 
simultaneous broadcast across all the processes at NCSA, which would typically 
require multiple messages to travel across NCSA's local network. 
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FIG. 3 An example of two 2-level topology-aware broadcast trees rooted at SDSC 
spanning 2 Origin2000s (02Ka and 02Kb) at NCSA and an IBM SP at SDSC: (a) 
clustering processes on machine boundaries and (b) clustering on site boundaries. 



2.3. A Multilevel Topology- Aware Broadcast 

The multilevel topology-aware approach we present minimizes messaging across 
the slowest links at each level by clustering the processes at the wide-area level into 
site groups, and then within each site group, clustering processes at the local- area 



level into machine groups. Using the same broadcast example from Section 2.2 , 
we depict in Figure ^ the broadcast tree used by a multilevel approach. Here 
the broadcast starts with the SDSC root process sending a single message across 
the wide-area network to one of the processes at NCSA, in Figure || we depict 
a process on 02Kq. The broadcast continues with the receiving process on 02Ka 
sending a single message across NCSA's local network to a process on 02Kf, and the 
entire broadcast concludes with broadcasts within each machine. This multilevel 
clustering minimizes messaging over the slower wide- and local-area networks. 

3. MULTILEVEL TOPOLOGY-AWARE APPROACH IN MPICH-G2 

In this section we describe our implementation of multilevel topology-aware 
collective operations in the Globus Toolkit-based MPICH-G2. For illustrative pur- 
poses, we discuss our implementation of MPI_Bcast in detail. 

3.1. RSL Specification of Topology 

MPICH-G2 uses the Globus Toolkit's Resource Specification Language (RSL) Q 
to describe the resources required to run an application. Users write RSL scripts, 
which identify resources (e.g., computers) and specify requirements (e.g., number of 
CPUs, memory, execution time) and parameters (e.g., location of executables, com- 
mand line arguments, environment variables) for each. An RSL script can be used 
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FIG. 4 An example of a multilevel topology-aware broadcast tree rooted at SDSC 
spamiing 2 Origin 2000s (02Ka and 02Kfc) at NCSA and an IBM SP at SDSC. 

as the user interface to globusrun, an upper-level Globus service that first authenti- 
cates the user by using the Grid Security Infrastructure (GSI) |^ and then schedules 
and monitors the job across the various machines by using two other Globus Toolkit 
services: the Dynamically-Updated Request Online Coallocator (DUROC) ^ and 
Grid Resource Allocation and Management (GRAM) RSL is designed to be an 
easy-to-use language to describe multiresource multisite jobs while hiding all the 
site-specific details associated with requesting such resources. 

Figure || depicts an RSL script for an MPICH-G2 application intended to run 
on the computational Grid depicted in Figure ^ It depicts a job as a set of three 
subjobs, where each subjob is associated with a particular resource, in our exam- 
ple, a computer. Subjobs define a natural machine-boundary partitioning of the 
processes in MPI_COMM_WORLD and are sufficient for a 2-level machine boundary clus- 
tering of the processes. To achieve a multilevel clustering, the user must identify 
those machines that are on the same local network by specifying a value for an 
MPICH-G2-defined environment variable GLOBUS_LAN_ID, as depicted in the RSL 
script in Figure ^. Specifying the same value (NCSAlan) in the second and third 
subjobs instructs MPICH-G2 to cluster these two machines into the same local-area 
network group. This same technique can be used to cluster many subjobs in the 
same local-area network group while simultaneously creating multiple local-area 
network groups through the assignment of multiple yet unique GLOBUS_LAN_ID val- 
ues. This simple specification (the only difference between Figures I and I) is all 
that is required to create multilevel topology-aware clustering of the processes. 

The multilevel clustering information specified in RSL (i.e., processes gathered 
first into machine groups and then local network groups composed of machine 
groups) creates a multilevel grouping of the processes in MPI_CDMM_WORLD and is 
distributed to all the processes during MPICH-G2 bootstrapping to be stored within 
MPI_CDMM_WDRLD on each process. When new communicators are created (e.g., via 
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( &(resourceManagerContact="sp .npaci . edu") 

(count=10) 
(jobtype=mpl) 
(label="subjob 0") 

(enviromiient= (GLOBUS_DUROC_SUB JOB_INDEX 0) ) 

(directory=/homes/users/ smith) 

(exe cut able=/hoines/users/ smith/my app) 

) 

( & (resourceManagerContact=" o2ka . ticsa . uiuc . edu" ) 

(count=5) 
(jobtype=inpi) 
(label="subjob 1") 

(environment=(GLOBUS_DURDC_SUBJOB_IMDEX 1) ) 

(directory=/users/siiiith) 

(executable=/users/smith/myapp) 

) 

( & (resourceManagerContact="o2kb . ncsa . uiuc . edu" ) 
(couiit=5) 
(jobtype=mpi) 
(label="subjob 2") 

(environment=(GLOBUS.DUROC_SUBJOB_INDEX 2) ) 

(directory=/users/smith) 

(executable=/users/smith/myapp) 



FIG. 5 An RSL script for an MPICH-G2 application running on three machines 
that facilitates 2-level process clustering. 

+ 

( &(resourceManagerContact="sp. npaci . edu") 
(couiit=10) 
(jobtype=mpi) 
(label="subjob 0") 

(environment=(GLOBUS_DUROC_SUBJOB_INDEX 0) ) 

(directory=/homes/users/smith) 

(executable=/homes/users/smith/myapp) 

) 

( & (resour ceManagerCont act= " o2ka . ncsa . uiuc . edu" ) 
(count=5) 

(jobtype=nipi) 
(label="subjob 1") 

(environinent=(GLOBUS_DUROC_SUBJOB_INDEX 1) 

(GLOBUS_LAN_ID NCSAlan)) 
(dir ect or y=/users/ smith) 
(executable=/users/ smith/myapp) 

) 

( & (resourceManagerContact=" o2kb . ncsa . uiuc . edu" ) 
(count=5) 
(jobtype=mpi) 
(label="subjob 2") 

(environment=(GLOBUS_DUROC_SUBJOB_INDEX 2) 

(GLOBUS.LAN.ID NCSAlan) ) 
(directory=/users/smith) 
(executable=/users/smith/myapp) 



FIG. 6 An RSL script for an MPICH-G2 application running on three machines 
that facilitates multilevel process clustering. 
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MPI_Comm_split), MPICH-G2 propagates the relevant multilevel clustering infor- 
mation to the newly created communicator so that all communicators in MPICH-G2 
have the multilevel clustering information pertaining to their process groups. As an 
interesting side effect we have made this multilevel topology information available 
to MPI applications through existing MPI communicator caching idioms. See 
for a full description of MPICH-G2's topology discovery mechanism. 

3.2. MPICH-G2's Multilevel Topology- Aware Broadcast 

A multilevel topology-aware clustering of processes is not sufhcient in itself 
to allow the construction of a broadcast tree such as that depicted in Figure ^: 
MPICH-G2 also needs to know which process is the root of the broadcast. Con- 
struction of the multilevel topology-aware tree is therefore deferred until the ap- 
plication calls a collective operation. At that time each process simultaneously 
and independently (i.e., without communication) construct an identical tree based 
on the multilevel process grouping found in the communicator and the parameters 
passed (e.g., identifying the root process of a broadcast) to the collective operation. 

One benefit of using a multilevel topology-aware tree to implement a collective 
operation is that we are free to select different subtree topologies at each level. 
For example, a multilevel broadcast tree can start with a broadcast from the root 
to selected processes at each site across a wide-area network, followed by broad- 
casts at each site to selected processes on each machine across the local networks, 
and concluding with broadcasts within each machine. We have the freedom to use 
different broadcast topologies at each stage in the sequence. Bar-Noy and Kipnis 
show that in high-latency networks (e.g., a wide-area network) the optimal broad- 
cast topology is a flat tree in which the root sends the data to all other processes 
directly, while in a low-latency network (e.g., intramachine messaging), the optimal 
broadcast topology is a binomial tree . We take advantage of these findings and 
the flexibility of our multilevel approach in our implementation of MPI_Bcast by 
using a flat broadcast tree at the initial wide-area level and binomial trees at the 
local-area and intramachine levels. 

In the next section we present results demonstrating the advantages of our multi- 
level approach by comparing it with the default (topology- unaware) implementation 
provided by MPICH and a topology-aware two-layer implementation. 

4. EXPERIMENTAL RESULTS 

To demonstrate the advantages of our multilevel approach, we examine its ef- 
fects on MPI_Bcast. The MPICH implementation of MPI_Bcast is based on bino- 
mial trees; hence, in a distributed heterogeneous environment like a computational 
Grid its performance is acutely sensitive to the distribution of the processes and 
the root of the broadcast. For example, in an application using P = 2*^ processes 
distributed evenly across C = 2',0<i<fc clusters, a broadcast implemented using 
a binomial tree propagates the message down its longest path using at least log2C 
intercluster messages and log2^ intracluster messages. In contrast, under certain 
intercluster network performance conditions described by Bar-Noy and Kipnis in 
their postal model, our multilevel method could be used to send 1 intercluster 
message and log2^ intracluster messages. Assuming an intercluster latency Is sec 
and bandwidth hs Kb/sec; and an intracluster latency If sec and bandwidth bf 
Kb/sec, broadcasting a message of N Kb using the binomial tree conservatively 
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For (each message size M) 
MPI_Barr ier (MPI_COMM_WORLD) 
if (MPI_COMM_WORLD rank == 0) 

to = get_time() 
For (r = 0; r < Nprocs; r ++) 

MPI_Bcast(root=r to MPI_COMM_WORLD message size M) 

ack_barrier () 
if (MPI_COMM_WORLD rank == 0) 

tl = get_time() 

report message size M, time tl-tO 

FIG. 7 The broadcast timing application. 

takes 0{{logC){ls + j-) + {^og^){lf + ^))^ whereas broadcasting the same mes- 
sage using our multilevel method takes only 0{{ls + ^) + {log^){lf + ^))- 

We wrote a small MPI application (depicted in Figure [t]) that times the broad- 
casts of messages of increasing size. To represent a broadcast with an arbitrary 
root, we timed how long it would take to broadcast each message of size M as each 
process in MPI_COMM_WORLD took its turn as the root. Also, in order to eliminate any 
potential pipelining that might occur between consecutive broadcasts, we inserted 
a barrier (ack_barrier()) after each broadcast in which all processes other than 
rank MPI_Send an ACK message to process and then wait to MPI_Recv a GO 
message. Process 0, after MPI_Recv'ing the ACK message from all the other pro- 
cesses, MPI_Send's a GO message to each of the other processes, one at a time. We 
chose to write our own barrier rather than calling MPI_Barrier because we have 
reimplemented MPI_Barrier to reflect multilevel topology and we wished these 
tests to reflect the differences only in the broadcast implementations. 

We conducted experiments running the MPI application depicted in Figure |^ on 
three computers: the IBM SP at the San Diego Supercomputer Center (SDSC-SP) 
and the IBM SP (ANL-SP) and SGI Origin200 (ANL-02K) at Argonne National 
Laboratory. We compare our multilevel topology approach to the binomial tree 
provided by MPICH and include comparisons to the 2-level approach provided 
by MagPIe. We ran the application four times, each time using 16 processes on 
each of the three computers. These results are depicted in Figure |^. The curves 
labeled "MagPIe-machine" and "MagPIe-site" represent two runs using MagPIe 
version 2.0.1, each time with a different cluster definition. In our first MagPIe 
run ("MagPIe-machine") we defined three clusters, one for each computer, of 16 
processes each. In our second MagPIe run ( "MagPIe-site" ) we defined two clusters: 
an ANL cluster comprising the two ANL machines having 32 processes and an 
SDSC cluster comprising the SDSC-SP having only 16 processes. 

Figure ^ shows there are significant benefits to the multilevel approach when 
compared with a simple binomial tree and even when compared with a 2-level 
approach as implemented by MagPIe. A multilevel view of the network allows an 
application to avoid slower channels at each level. In our experiments, the broadcast 
is optimized by sending one message across the wide-area network, then one message 
across the local-area network, and then many messages within each computer. 
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FIG. 8 Original MPICH broadcast vs. topology-aware MPICH broadcast vs. Mag- 
Pie broadcast running 16 processes on the IBM SP at SDSC and 16 processes on 
each the IBM SP and SGI Origin2000 at ANL. 
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5. RELATED WORK 



Previous efforts have focused on creating "optimal" trees for collective opera- 
tions where point-to-point communications are not necessarily equal between any 
two processes. Husbands and Hoe present MPI-StarT ||l3|, an MPI implementa- 
tion for a cluster of SMPs interconnected by a high-performance interconnect. They 
report significant improvements after modifying the MPICH broadcast algorithm, 
which uses binomial trees. Their modifications use information that describes their 
cluster topology by minimizing intercluster communication during collective opera- 
tions. MagPIe jl^ is another MPI system designed to construct collective operation 
trees in heterogeneous communication environments. MagPIe recognizes a two-layer 
communication network that distinguishes between local- and wide-area communi- 
cation. By minimizing wide-area communication, much in the same way MPI-StarT 
minimizes intercluster communication, MagPIe has seen significant improvements 
in all the MPI collective operations. 

Both efforts have produced impressive results and clearly demonstrate that there 
are significant advantages to implementing collective operations in a topology-aware 
manner. However, both limit their view of the network to only two layers; MPI- 
StarT distinguishes between intra- and intercluster communication within the same 
local-area, and MagPIe distinguishes between local- and wide-area communication. 
There are opportunities for further optimization by using trees that stratify the 
network deeper than two layers. 

In Q van de Geijn et al. show the advantages of implementing collective oper- 
ations by segmenting and pipelining messages when communicating over relatively 
slower channels (e.g., TCP over local- and wide-area networks). 

In p5| Kielman et al. extend MagPIe by incorporating van de Geijn's pipelining 
idea through a technique they call Parameterized LogP (PLogP), which is an ex- 
tension of the LogP model presented by Culler et al |^] . In this extension, MagPIe 
still recognizes only a two-layer communication network, but through parameter- 
ized studies of the network, the researchers determine "optimal" packet sizes. This 
technique works well for applications that always run on the same computational 
grid having relatively stable performance, but requires retuning when moving the 
application from one computing environment or network to another. 

6. FUTURE WORK 

We have implemented five of the MPI collective operations in a topology-aware 
multilevel manner in MPICH-G2. Encouraged by our initial results, we plan to 
upgrade MPICH-G2's remaining MPI collective operations in a similar manner. 

Our general strategy implements a collective operation by first stratifying the 
network into multiple levels and then minimizing the communication across the 
slowest channels. In doing so, however, we may encounter a tree that has mul- 
tiple siblings at a particular level, for example, many sites connected across the 
wide-area network or many machines at a particular site. When this situation hap- 
pens, we implement the collective operation at that level using a binomial tree at 
all but the wide-area network level. Unfortunately, a binomial tree is not always 
the best choice. Bar-Noy and Kipnis show that the shape of a collective operation 
tree depends heavily on the point-to-point communication characteristics of the 
send/receive primitives on which it is implemented. Their model incorporates a 
latency parameter A > 1. They show that for low latencies, (for example, commu- 
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nication within a single machine), the optimal broadcast tree is a binomial tree, but 
for higher latencies, (for example, communication across a wide-area network), the 
optimal broadcast tree becomes flatter. We will investigate ways to select better, if 
not optimal, collective operation trees by choosing those that respect the different 
communication characteristics at each level of our multilevel view. 

The pipelining techniques presented by van de Geijn et al. can be used at 
each of the levels in MPICH-G2's multilevel topology-aware collective operations. 
Using techniques similar to Kielman's PLogP method, we will develop methods to 
determine the appropriate packet sizes with respect to network performance at each 
level of our multilevel view. 

7. SUMMARY 

As Grid computations become increasingly prevalent, the need for topology- 
aware collective operations also increases. We have a version of MPICH-G2 that 
implements five collective operations in a multilevel topology-aware manner. We 
have shown, at least for MPI_Bcast, that when compared with the binomial tree 
provided by MPICH and the 2-level approach provided by MagPIe there are signif- 
icant advantages to excciiting collective operations using a multilevel view of the 
network. Through a simple process of identifying machines that are common to a 
local-area network, we have provided a means by which an MPI application may 
take advantage of the multilevel topology-aware algorithms without requiring code 
modifications or special functions. 
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